37 MANOVA and MANCOVA - Practical

37.1 Introduction

In this practical we begin by walking-through a simple example of MANOVA.
You’ll then have the opportunity to apply a MANOVA on a different dataset.
We’ll also walk-through an example of MANCOVA, which will include exploring the influence of covariates on group-based differences on multiple outcome variables.
Finally, you’ll have the opportunity to apply a MANCOVA on a different dataset.

37.2 MANOVA: demonstration

You’ll remember that MANOVA is used when we want to identify group (or sample) based differences in two or more dependent variables. It extends the principle of ANOVA, which can only deal with one dependent variable.

Load data

First, we’ll create a dataframe df by downloading a .csv file.

rm(list=ls())
df <- read.csv('https://www.dropbox.com/scl/fi/jjh1sohgsv590f04x75pv/29_01.csv?rlkey=vhlxu0pziwymec3n7fiki2x1x&dl=1')

head(df) # display the first six rows

  X athlete_id training_program gender age sprint_time jump_height
1 1          1        Program_A   Male  20       11.79        1.81
2 2          2        Program_A   Male  20       11.77        1.72
3 3          3        Program_A   Male  20       11.33        1.83
4 4          4        Program_A Female  22        9.99        2.05
5 5          5        Program_A   Male  20       10.88        2.80
6 6          6        Program_A Female  22       10.72        1.96
  throw_distance
1          60.15
2          30.07
3          45.73
4          51.17
5          41.07
6          53.34

In this dataframe, each observation (row) represents an athlete. We can see that each athlete has been allocated to one of three training programs (A, B or C).

This is the independent variable: the group to which the athlete has been allocated.

We can also see that their performance on three different outcome variables has been measured (sprint_time, jump_height, and throw_distance).

These are the dependent variables: the outcome measures we have collected.

Explore the Data

It’s always good practice to explore our data before we begin any complicated analyses.

In this case, we’re particularly interested in what the means of each group look like, for each of our three dependent variables.

We can use the dplyr package to examine differences in the mean of each outcome variable by group.

library(dplyr)

group_summary <- df %>%
  group_by(training_program) %>%
  summarise(
    Mean_Sprint_Time = mean(sprint_time),
    SD_Sprint_Time = sd(sprint_time),
    Mean_Jump_Height = mean(jump_height),
    SD_Jump_Height = sd(jump_height),
    Mean_Throw_Distance = mean(throw_distance),
    SD_Throw_Distance = sd(throw_distance)
  )
print(group_summary)

# A tibble: 3 × 7
  training_program Mean_Sprint_Time SD_Sprint_Time Mean_Jump_Height
  <chr>                       <dbl>          <dbl>            <dbl>
1 Program_A                    11.1          0.948             2.06
2 Program_B                    12.1          0.955             2.10
3 Program_C                    12.9          0.909             2.03
# ℹ 3 more variables: SD_Jump_Height <dbl>, Mean_Throw_Distance <dbl>,
#   SD_Throw_Distance <dbl>

We can immediately see that there are between-group differences in the means for all the dependent variables. For example, athletes in Program_A have a lower mean sprint_time than those in Program_B.

We could also create some visuals to help explore the data:

# Sprint Time Boxplot
boxplot(sprint_time ~ training_program, data = df,
        main = "Sprint Time by Training Program",
        xlab = "Group", ylab = "Sprint Time (seconds)")

# Jump Height Boxplot
boxplot(jump_height ~ training_program, data = df,
        main = "Jump Height by Training Program",
        xlab = "Group", ylab = "Jump Height (meters)")

# Throw Distance Boxplot
boxplot(throw_distance ~ training_program, data = df,
        main = "Throw Distance by Training Program",
        xlab = "Group", ylab = "Throw Distance (meters)")

These boxplots confirm what we saw earlier. It looks like athletes in Program A have a lower sprint_time than the other groups. However, we can’t yet say with confidence that these differences are significant (i.e., meaningful).

We need a statistical test to see if these differences are ‘real’.

Run MANOVA

We’re now ready to examine the effect of our independent variable, training_program, on the three dependent variables (sprint_time, jump_height and throw_distance).

Remember, we are hypothesising that an athlete’s performance on each of these variables depends on which level of the independent variable they belong to.

Here’s the R code to run the MANOVA:

# Conduct MANOVA
manova_fit <- manova(cbind(sprint_time, jump_height, throw_distance) ~ training_program, data = df)

# Get summary of the MANOVA
summary(manova_fit)

                 Df Pillai approx F num Df den Df    Pr(>F)    
training_program  2 0.3895   7.7393      6    192 1.829e-07 ***
Residuals        97                                            
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.aov(manova_fit) # For univariate effects

 Response sprint_time :
                 Df Sum Sq Mean Sq F value    Pr(>F)    
training_program  2 51.420 25.7098  29.255 1.143e-10 ***
Residuals        97 85.244  0.8788                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response jump_height :
                 Df  Sum Sq Mean Sq F value Pr(>F)
training_program  2  0.0734 0.03672  0.1467 0.8638
Residuals        97 24.2867 0.25038               

 Response throw_distance :
                 Df  Sum Sq Mean Sq F value Pr(>F)
training_program  2   108.3  54.158  0.4995 0.6084
Residuals        97 10516.3 108.416

The second line of the output (starting with ‘training_program’) suggests that there are some significant between-group differences on at least one of the outcome variables. Notice the three *’s at the end of the line, which indicates p < 0.001.

In other words, athletes in different training programmes (A,B and C) appear to perform significantly differently on at least one outcome.

However, these first lines tell us nothing about the nature of that difference, or where it occurs.

Note

At this point, if you weren’t seeing a significant result in this first part of the output, you’d conclude that there are no significant differences as a result of training programme on any of the dependent variables.

As we have found a significant overall result, we need to explore this further.

The second part of our output tells us where those differences lie: it is in sprint time, as we suspected (again, notice the *’s indicating significance for this dependent variable).

So there are significant group-based differences for sprint time, but not for jump height (p>0.1) or training program (p>0.1).

Tukey’s ‘Honestly Significant Difference’ (HSD)

If we find a significant between-group difference for one or more of our dependent variables, we would typically use pairwise comparisons to examine the differences between levels of the independent variable for each dependent variable.

Remember, at this point, we don’t know which groups are significantly different from each other.

One common approach is to use Tukey’s Honestly Significant Difference (HSD) test. This test controls for the family-wise error rate when conducting multiple pairwise comparisons.

Notice that I’m only doing this for sprint_time, because we already know there are no significant differences between groups for our other dependent variables.

What is family-wise error rate?

FWER refers to the probability of making one or more false discoveries, or Type I errors, when performing multiple hypothesis tests simultaneously.

It’s the likelihood of incorrectly rejecting at least one true null hypothesis in a set of comparisons. Controlling the FWER is crucial in studies involving multiple comparisons to ensure the reliability of results, as it helps to maintain the overall error rate at a desired level, thereby reducing the chance of drawing incorrect conclusions from the data.

# Load necessary libraries
library(dplyr)
library(multcomp)

# Conducting Post-Hoc Tests
# Tukey's HSD for Sprint_Time
tukey_sprint <- TukeyHSD(aov(sprint_time ~ training_program, data = df))
print("Tukey's HSD for Sprint_Time:")

[1] "Tukey's HSD for Sprint_Time:"

print(tukey_sprint)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = sprint_time ~ training_program, data = df)

$training_program
                         diff       lwr      upr     p adj
Program_B-Program_A 0.9771569 0.4318958 1.522418 0.0001364
Program_C-Program_A 1.7471569 1.2018958 2.292418 0.0000000
Program_C-Program_B 0.7700000 0.2206849 1.319315 0.0034206

This output confirms significant between-group differences for all training programmes on sprint time. In each case, p < 0.01.

We can go back to our table of means to interpret this; Program A had a mean sprint time of 11.1, B had a mean of 12.1 and C had a mean of 12.9.

Athletes in A had a significantly better performance than those in B and C, and athletes in B had a significantly better performance than those in C.

Therefore, at the end of our analysis, we can conclude that the training programs had a significant effect on an athlete’s sprint time, but had no significant effect on their jump height or throw distance.

I wonder if you can reflect on why this might be useful within an applied sport data analytics situation? What could you do with such information?

37.3 MANOVA: Practice

Using the data file available at the following link, repeat the analysis above. Take time to examine the data before undertaking the MANOVA.

rm(list=ls())
df_2 <- read.csv('https://www.dropbox.com/scl/fi/3d5itbmyp5wgqbkajhctm/basketball_dataset.csv?rlkey=6wrmvey8uhvu67egh8g8twnmr&dl=1')

Download the datafile and create a new dataframe. Explore the dataframe to determine what it represents.
Conduct some exploratory analysis of the data, including group-based differences.
Conduct a MANOVA to examine whether there are any significant differences in the dependent variables between groups.
If you find there are, explore your model further to determine where those differences lie.
If you identify significant differences, utilise Tukey’s HSD to pinpoint where those differences are found.

Show solution

#----------------------
# data loading
#----------------------
df_2 <- read.csv('https://www.dropbox.com/scl/fi/3d5itbmyp5wgqbkajhctm/basketball_dataset.csv?rlkey=6wrmvey8uhvu67egh8g8twnmr&dl=1')

#----------------------
# data preparation
#----------------------
df_2$X <- NULL
str(df_2)

'data.frame':   200 obs. of  3 variables:
 $ group        : chr  "Team A" "Team A" "Team A" "Team A" ...
 $ points_scored: int  86 82 64 81 79 64 74 93 87 85 ...
 $ assists      : int  10 11 7 4 11 0 16 10 11 10 ...

Show solution

df_2$group <- as.factor(df_2$group)
str(df_2)

'data.frame':   200 obs. of  3 variables:
 $ group        : Factor w/ 4 levels "Team A","Team B",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ points_scored: int  86 82 64 81 79 64 74 93 87 85 ...
 $ assists      : int  10 11 7 4 11 0 16 10 11 10 ...

Show solution

#----------------------
# data visualisation
#----------------------

library(ggplot2)

# Boxplot for points scored by group
ggplot(df_2, aes(x = group, y = points_scored, fill = group)) +
  geom_boxplot() +
  labs(title = "Boxplot of Points Scored by Group",
       x = "Group",
       y = "Points Scored") +
  theme_minimal()

Show solution

# Boxplot for assists by group
ggplot(df_2, aes(x = group, y = assists, fill = group)) +
  geom_boxplot() +
  labs(title = "Boxplot of Assists by Group",
       x = "Group",
       y = "Assists") +
  theme_minimal()

Show solution

#----------------
# Conduct MANOVA
#-----------------
manova_fit <- manova(cbind(points_scored, assists) ~ group, data = df_2)

# Get summary of the MANOVA
summary(manova_fit)

           Df  Pillai approx F num Df den Df    Pr(>F)    
group       3 0.17806    6.385      6    392 1.982e-06 ***
Residuals 196                                             
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show solution

summary.aov(manova_fit) # For univariate effects

 Response points_scored :
             Df  Sum Sq Mean Sq F value Pr(>F)
group         3   551.6 183.880  1.9175  0.128
Residuals   196 18795.1  95.893               

 Response assists :
             Df Sum Sq Mean Sq F value    Pr(>F)    
group         3  656.0 218.672  10.935 1.133e-06 ***
Residuals   196 3919.6  19.998                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Show solution

#------------------
# Explore group-based differences using Post-Hoc tests

# Load  libraries
library(multcomp)

# Tukey's HSD for Sprint_Time
tukey_assists <- TukeyHSD(aov(assists ~ group, data = df_2))
print("Tukey's HSD for Assists:")

[1] "Tukey's HSD for Assists:"

Show solution

print(tukey_assists)

  Tukey multiple comparisons of means
    95% family-wise confidence level

Fit: aov(formula = assists ~ group, data = df_2)

$group
               diff         lwr        upr     p adj
Team B-Team A  3.18  0.86247773  5.4975223 0.0026452
Team C-Team A -1.34 -3.65752227  0.9775223 0.4405080
Team D-Team A  2.38  0.06247773  4.6975223 0.0416409
Team C-Team B -4.52 -6.83752227 -2.2024777 0.0000059
Team D-Team B -0.80 -3.11752227  1.5175223 0.8077275
Team D-Team C  3.72  1.40247773  6.0375223 0.0002771

37.4 MANCOVA: Demonstration

The purpose of MANCOVA is to include an analysis of the effects of potential covariates in our data.

Step One: Load data

First, I’m going to clean the environment and re-load the first dataset.

rm(list=ls())
df <- read.csv('https://www.dropbox.com/scl/fi/xe2n37vfyck76i7hw4gnn/mancova_cycling.csv?rlkey=1khog9n5p1sh9evgsi6ovr5ta&dl=1')
df$X <- NULL
head(df) # display the first six rows

  cyclist_id training_program age speed elevation   time
1          1        Program_A  29 46.40   1517.92 207.91
2          2        Program_A  37 38.76   1288.62 202.16
3          3        Program_A  31 36.82   1284.83 223.19
4          4        Program_A  38 40.33   1765.40 222.11
5          5        Program_A  39 38.74   1195.32 186.89
6          6        Program_A  26 43.22   2086.59 219.49

In this cycling-based dataset, we have a similar structure to the previous dataset. There is a training_program (the IV, which is Program A or Program B) and a series of DVs (speed. elevation and time). We also have the athlete’s age.

We’re going to carry out a similar analysis to the previous one, asking the question “does training program have a significant effect on performance on any of the three dependent variables”? This time, we’re going to include age as a covariate, because it’s reasonable to suspect that it might be influential in how training program impacts performance.

Step Two: Examine potential interactions

When we conducted the MANOVA analysis above, we ignored the age variable. All we were interested in was whether the training program had a significant effect on any of the outcome variables.

We can visually inspect the impact of DV on the IVs:

# Time Boxplot
boxplot(time ~ training_program, data = df,
        main = "Time by Training Program",
        xlab = "Group", ylab = "Time (seconds)")

# Elevation Boxplot
boxplot(elevation ~ training_program, data = df,
        main = "Elevation by Training Program",
        xlab = "Group", ylab = "Elevation (ft)")

# Speed Boxplot
boxplot(speed ~ training_program, data = df,
        main = "Speed by Training Program",
        xlab = "Group", ylab = "Speed (km/H")

But it’s reasonable to assume that age might also be important in this association. That is where MANCOVA becomes useful; we want to examine covariates of our main predictor variable, in this case age.

Reminder: what’s a covariate?

A ‘covariate’ is a variable that is possibly predictive of the outcome under study (the dependent variable). In statistics, covariates can be ‘controlled’, which lets us isolate the effect of the primary variable(s) of interest (the independent variable/s).

Interaction plots are useful for visualising how the relationship between one independent variable and a dependent variable changes across levels of another independent variable or covariate.

# Interaction of Program, Age, and Speed
interaction.plot(df$training_program, df$age, df$speed,
                 main="Interaction of Program and Age on Speed",
                 xlab="Training Program", ylab="Sprint Time", col=c("red","blue","green"))

# Interaction of Program, Age, and Elevation
interaction.plot(df$training_program, df$age, df$elevation,
                 main="Interaction of Program and Age on Elevation",
                 xlab="Training Program", ylab="Elevation", col=c("red","blue","green"))

# Interaction of Program, Age, and Time
interaction.plot(df$training_program, df$age, df$time,
                 main="Interaction of Program and Age on Time",
                 xlab="Training Program", ylab="Time", col=c("red","blue","green"))

Here are some other ways you can visualise interaction effects:

library(ggplot2)

# Interaction plot for Speed as a function of Training Program and Age
ggplot(df, aes(x=training_program, y=speed, color=age)) +
  geom_point() +
  geom_line(aes(group=age)) +
  theme_minimal() +
  labs(title="Interaction of Training Program and Age on Speed",
       y="Speed (km/h)",
       x="Training Program")

df_long <- tidyr::pivot_longer(df, cols=c(speed, elevation, time), names_to="PerformanceMetric", values_to="Value")

ggplot(df_long, aes(x=training_program, y=Value, color=age)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  facet_wrap(~PerformanceMetric, scales="free_y") +
  theme_minimal() +
  labs(title="Interactions Across Performance Metrics",
       y="Metric Value",
       x="Training Program")

`geom_smooth()` using formula = 'y ~ x'

library(lattice)

coplot(speed ~ elevation | cut(age, 2), data=df, panel=panel.smooth, xlab="Elevation (meters)", ylab="Speed (km/h)")

# Fit a linear model for demonstration
model <- lm(time ~ training_program * age, data = df)

# Create interaction plots based on model predictions
library(effects)
plot(allEffects(model))

Step Three: MANCOVA

The following example uses the the manova function from R’s base statistical package.

This lets us analyse the influence of training_program on our dependent variables while controlling for the effects of age, which is our covariate.

The command mancova_fit <- creates a new variable named mancova_fit that will store the results of the MANCOVA analysis.
The command manova() calls the function to perform the multivariate analysis of variance (MANOVA).

Note that the model specification is different from the MANOVA conducted above. This time, we’ve included a covariate (age), which makes the analysis a MANCOVA.

The command cbind(sprint_time, jump_height, throw_distance) specifies the dependent variables for the analysis, binding them together into a multivariate response. Here, sprint_time, jump_height, and throw_distance are the dependent variables being analysed.
~ separates the dependent variables from the independent variables and covariates in the formula.
training_program + age specifies the independent variable (training_program) and the covariate (age). The training_program variable is what the analysis aims to examine in terms of its effect on the three dependent variables, while age is included as a covariate to control for its potential confounding effect, allowing for a more accurate assessment of the impact of training_program.

# MANCOVA model with Age as a covariate
mancova_fit <- manova(cbind(time, elevation, speed) ~ training_program + age, data = df)

# Viewing the Results
summary(mancova_fit)

                  Df   Pillai approx F num Df den Df  Pr(>F)  
training_program   2 0.101563  2.58574      6    290 0.01863 *
age                1 0.008266  0.40006      3    144 0.75316  
Residuals        146                                          
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.aov(mancova_fit) # For univariate effects

 Response time :
                  Df Sum Sq Mean Sq F value Pr(>F)
training_program   2    495  247.53  0.6780 0.5092
age                1      8    7.81  0.0214 0.8839
Residuals        146  53303  365.09               

 Response elevation :
                  Df   Sum Sq Mean Sq F value Pr(>F)
training_program   2    39282   19641  0.2105 0.8104
age                1       63      63  0.0007 0.9793
Residuals        146 13621890   93301               

 Response speed :
                  Df  Sum Sq Mean Sq F value   Pr(>F)   
training_program   2  200.97 100.483  6.7063 0.001635 **
age                1   16.91  16.909  1.1285 0.289845   
Residuals        146 2187.57  14.983                    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Step Four: Interpretation

The MANCOVA output provides information on how the independent variables (both the grouping variable and the covariate) affect the combined dependent variables, controlling for the effects of the covariates.

On the first line of the output, we can see that training program membership has a signficant effect on at least one of the dependent variables.
However, age does not seem to be a significant covariate. Pillai’s Trace is a good general-purpose test. A significant p-value indicates that there is a statistically significant difference in the combined dependent variables across the groups, after controlling for the covariate(s).

In the example above, the p-value for Pillai’s Trace is > 0.05, which suggests that, overall, age does not play a significant role in the association between training program membership and any of the dependent variables.

Moving down, we can see that training program has a significant effect on speed. age is not significant, confirming that it does not play any role in mediating the relationship between program membership and speed.

Multiple Comparisons: If you further explore specific group differences with post-hoc tests, remember to adjust for multiple comparisons to control the Type I error rate.

37.5 MANCOVA: Practice

Re-load the first dataset.

rm(list=ls())
df <- read.csv('https://www.dropbox.com/scl/fi/jjh1sohgsv590f04x75pv/29_01.csv?rlkey=vhlxu0pziwymec3n7fiki2x1x&dl=1')
df$X <- NULL
head(df) # display the first six rows

  athlete_id training_program gender age sprint_time jump_height throw_distance
1          1        Program_A   Male  20       11.79        1.81          60.15
2          2        Program_A   Male  20       11.77        1.72          30.07
3          3        Program_A   Male  20       11.33        1.83          45.73
4          4        Program_A Female  22        9.99        2.05          51.17
5          5        Program_A   Male  20       10.88        2.80          41.07
6          6        Program_A Female  22       10.72        1.96          53.34

Now, conduct a MANOVA that explores whether age is a significant covariate on any of the outcome variables (sprint_time, jump_height, throw_distance).

You should be able to report:

Whether training program has any significant impact on any of the outcome variables.
Whether any of these findings (significant or insignificant) appear to be affected by the athlete’s age.

#| code-fold: true
#| code-summary: Show solution


# Interaction of Program, Age, and Sprint_Time
interaction.plot(df$training_program, df$age, df$sprint_time,
                 main="Interaction of Program and Age on Sprint Time",
                 xlab="Training Program", ylab="Sprint Time", col=c("red","blue","green"))

# Interaction of Program, Age, and Jump_Height
interaction.plot(df$training_program, df$age, df$jump_height,
                 main="Interaction of Program and Age on Jump Height",
                 xlab="Training Program", ylab="Jume Height", col=c("red","blue","green"))

# Interaction of Program, Age, and Throw_Distance
interaction.plot(df$training_program, df$age, df$throw_distance,
                 main="Interaction of Program and Age on Throw Distance",
                 xlab="Training Program", ylab="Throw Distance", col=c("red","blue","green"))

### MANCOVA

# MANCOVA model with Age as a covariate
mancova_fit <- manova(cbind(sprint_time, jump_height, throw_distance) ~ training_program + age, data = df)

# Viewing the Results
summary(mancova_fit)

                 Df  Pillai approx F num Df den Df    Pr(>F)    
training_program  2 0.39321   7.7494      6    190 1.821e-07 ***
age               1 0.05265   1.7415      3     94    0.1638    
Residuals        96                                             
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

summary.aov(mancova_fit) # For uni

 Response sprint_time :
                 Df Sum Sq Mean Sq F value    Pr(>F)    
training_program  2 51.420 25.7098 29.3759 1.113e-10 ***
age               1  1.225  1.2248  1.3995    0.2397    
Residuals        96 84.019  0.8752                      
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

 Response jump_height :
                 Df  Sum Sq  Mean Sq F value Pr(>F)
training_program  2  0.0734 0.036720  0.1452 0.8650
age               1  0.0070 0.007045  0.0279 0.8678
Residuals        96 24.2796 0.252913               

 Response throw_distance :
                 Df  Sum Sq Mean Sq F value  Pr(>F)  
training_program  2   108.3   54.16  0.5154 0.59892  
age               1   428.0  428.03  4.0731 0.04636 *
Residuals        96 10088.3  105.09                  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1